< 



(19) 



J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 




(12) 



(43) Dato of publication: 

29.09.1999 Bulletin 1999/39 



(n) EP 0 945 788 A2 

EUROPEAN PATENT APPLICATION 

(51) IntC! * G06F 9/38 



(21) Application number: 99200311.1 

(22) Date of filing: 03.02.1999 



(84) 


Designated Contracting States: 




• 


Gatherer, Allan 




AT BE CH CY DE DK ES Fl FR GB GR IE IT LI LU 




Richardson, Texas 75082 (US) 




MC NL PT SE 




• 


Lemonds, Karl E. Jr. 




Designated Extension States: 






Garland, Texas 75044 (US) 




AL LT LV MK RO SI 




• 


Hung, Ching-Yu 










Piano, Texas 75024 (US) 


(30) 


Priority: 04.02.1998 US 73668 P 






(71) 




V 


(74) 


Representative: Holt, Michael 


Applicant: TEXAS INSTRUMENTS 






Texas Instruments Ltd., 




INCORPORATED 






PO Box 5069 




Dallas, TX 75265 (US) 






Northampton, Northamptonshire NN4 7ZE (GB) 


(72) 


Inventors: 




Remarks: 


• 


Hocevar, Dale E. 






The references to the drawings figure 1 1 and figure 




Richardson, Texas 75074 (US) 






12 are deemed to be deleted (Rule 43 EPC). 



(54) Data processing system with digital signal processor core and co-processor 



(57) A data processing system includes a digital sig- 
nal processor core (110) and a co-processor (140). The 
co-processor (140) has a local memory (141, 145, 147) 
within the address space of the said digital signal proc- 
essor core (110). The co-processor (140) responds 
commands from the digital signal processor core (110). 
A direct memory access circuit (120) autonomously 
transfers data to and from the local memory (141, 145, 
147) of the co-processor (140). Co-processor com- 
mands are stored in a command FIFO memory (141) 



mapped to a predetermined memory address. Control 
commands includes a receive data synchronism com- 
mand stalling the co-processor (1 40) until completion of 
a memory transfer into the local memory (141, 145, 
147). A send data synchronism command causes the 
co-processor (140) to signal the direct memory access 
circuit (120) to trigger memory transfer out of the local 
memory (141, 145, 147). An interrupt command causes 
the co-processor (140) to interrupt the digital signal 
processor core (110). 
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Description 

TECHNICAL FIELD OF THE INVENTION 

[0001] The present invention relates generally to the 
fields of digital signal processing, and more particularly 
to a digital signal processor with a core data processor 
and a reconfigurable co-processor. 

BACKGROUND OF THE INVENTION 
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[0002] Digital signal processing is becoming more 
and more common for audio and video processing. In 
many instances a single digital processor can replace a 
host of prior discrete analog components. The increase is 
in processing capacity afforded by digital signal proces- 
sors had enabled more types of devices and more func- 
tions for prior devices. This process has created the ap- 
petite for more complex functions and features on cur- 
rent devices and new types of devices. In some cases 20 
this appetite has outstripped the ability to cost effectively 
deliver the desired functionality with full programmable 
digital signal processors. 

[0003] One response to this need is to couple a digital 
signal processor with an application specific integrated 2s 
circuit (ASIC). The digital signal processor is pro- 
grammed to handle control functions and some signal 
processing. The full programmability of the digital signal 
processor enables product differentiation through differ- 
ent programming. The ASIC is constructed to provide 30 
processing hardware for certain core functions that are 
commonly performed and time critical. With the increas- 
ing density of integrated circuits it is now becoming pos- 
sible to place a digital signal processor and an ASIC 
hardware co-processor on the same chip. 35 
[0004] This approach has two problems. This ap- 
proach rarely results in an efficient connection between 
the hardware co-processor ASIC and the digital signal 
processor. It is typical to handle most of the interface by 
programming the digital signal processor. In many cas- 40 
es the digital signal processor must supply data pointers 
and commands in real time as the hardware co-proces- 
sor is operating. To form safe designs, it is typical to pro- 
vide extra time for the digital signal processor to service 
the hardware co-processor. This means that the hard- 45 
ware co-processor is not fully used. A second problem 
comes from the time to design problem. With the in- 
creasing capability to design differing functionality, the 
product cycles have been reduced. This puts a premium 
on designing new functions quickly. The ability to reuse so 
programs and interfaces would aid in shortening design 
cycles. However, the fixed functions implemented in the 
ASIC hardware co-processor cannot easily be reused. 
The typical ASIC hardware co-processor has a limited 
set of functions suitable for a narrow range of problems, ss 
These designs cannot be quickly reused even to imple- 
ment closely related functions. In addition the interface 
between the digital signal processor and the ASIC hard- 



ware co-processor tends to use ad hoc techniques that 
are specific to a particular product. 

SUMMARY OF THE INVENTION 

[0005] This invention is a data processing system in- 
cluding a digital signal processor core and a co-proces- 
sor. The co-processor has a local memory within the ad- 
dress space of the said digital signal processor core 
The co-processor is responsive to commands from the 
digital signal processor core to perform predetermined 
data processing operations on data stored in said local 
memory in parallel with digital signal processor core 
The data processing system includes a direct memory 
access circuit under the control of the digital signal proc- 
essor core. The direct memory access circuit autono- 
mously transfers data to and from the local memory of 
the co-processor. 

[0006] The co-processor responds to commands to 
configure itself correspondingly to perform a set of re- 
lated data processing operation. Co-processor com- 
mands are stored in a command first in first out memory 
The command FIFO memory has an mapped to a pre- 
determined memory address. 

[0007] The co-processor is responsive to various con- 
trol commands. A receive data synchronism command 
pauses processing commands until the direct memory 
access circuit signals completion of a memory transfer 
into the local memory. A send data synchronism com- 
mand causes the co-processor to signal the direct mem- 
ory access circuit to trigger a predetermined memory 
transfer out of the local memory. An interrupt command 
causes the co-processor to interrupt the digital signal 
processor core. 

[0008] Each command includes an indication of a da- 
ta input location within the local memory. The co-proc- 
essor recalls data from local memory starting with the 
indicated data input location. Each command includes 
an indication of a data output location within the local 
memory. The co-processor stores resultant data local 
memory starting with the indicated data input location 
The input data may be stored in a circularly organized 
memory area serving as an input buffer. The resultant 
data may be stored in a circularly organized memory ar- 
ea serving as an output buffer. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0009] The present invention will now be further de- 
scribed by way of example, with reference to the exem- 
plary embodiments illustrated in the accompanying 
drawings, in which: 

Figure 1 illustrates the combination of a digital sig- 
nal processor core and a reconfigurable hardware 
co-processor in accordance with this invention; - 
Figure 2 illustrates the memory map logical cou- 
pling between the digital signal processor core and 
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the reconfigurable hardware co-processor of this in- 
vention; 

Figure 3 illustrates a manner of using the reconfig- 
urable hardware co-processor memory; 
Figure 4 illustrates a memory management tech- 
nique useful for filter algorithms; 
Figure 5 illustrates an alternative embodiment of the 
combination of Figure 1 including two co-proces- 
sors with a private bus between; 
Figure 6 illustrates the construction of a hardware 
co-processor which is reconfigurable to perform a 
variety of filter functions; 

Figure 7 illustrates the input formatter of the recon- 
figurable hardware co-processor illustrated in Fig- 
ure 6; 

Figure 8 illustrates the reconfigurable data path 
core of the reconfigurable hardware co-processor 
illustrated in Figure 6; 

Figure 9 illustrates the output formatter of the recon- 
figurable hardware co-processor illuslraled in Fig- 
ure 6; 

Figure 10 illustrates the data flow connections 
through the data path core for performing a real fi- 
nite impulse response filter; 

Figure 11 illustrates the data flow connections 
through the data path core for performing a complex 
finite impulse response filter; 
Figure 12 illustrates the data flow connection 
through the data path core for performing a coeffi- 
cient update function; and 

Figure 13 illustrates the data flow connections 
through the data path core for performing fast Fou- 
rier transform. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

[0010] Figure 1 illustrates circuit 100 including a dig- 
ital signal processor core 1 1 0 and a reconfigurable hard- 
ware co-processor 140. In accordance with the pre- 
ferred embodiment of this invention, these parts are 
formed in a single integrated circuit. Digital signal proc- 
essor core 110 may be of conventional design. In the 
preferred embodiment digital signal processor core 110 
is adapted to control direct memory access circuit 120 
for autonomous data transfers independent of digital 
signal processor core 110. External memory interface 
130 serves lo interface the internal data bus 101 and 
address bus 103 to their external counterparts external 
data bus 1 31 and external address bus 1 33, respective- 
ly. External memory interface 1 30 is conventional in con- 
struction. Integrated circuit 100 may optionally include 
additional conventional features and circuits. Note par- 
ticularly that the addition of cache memory to integrated 
circuit 100 could substantially improve performance. 
The parts illustrated in Figure 1 are not intended to ex- 
clude the provision of other conventional parts. Those 
conventional parts illustrated in Figure 1 are merely the 



parts most effected by the addition of reconfigurable 
hardware co-processor 1 40. 

[0011] Reconfigurable hardware co-processor 140 is 
coupled to other parts of integrated circuit 100 via data 

s bus 1 01 and address bus 103. Reconfigurable hardware 
co-processor 140 includes command memory 141, co- 
processor logic core 143, data memory 145 and coeffi- 
cient memory 147. Command memory 141 serves as 
the conduit by which digital signal processor core 110 

10 controls the operations of reconfigurable hardware co- 
processor 140. This feature will be further illustrated in 
Figure 2. Co-processor logic core 143 is responsive to 
commands stored in command memory 141 to perform 
co-processing functions. These co-processing func- 

'5 tions involve exchange of data between co-processor 
logic core 143 and data memory 145 and coefficient 
memory 147. Data memory 145 stores the input data 
processed by reconfigurable hardware co-processor 

140 and further stores the resultant of the operations of 
20 reconfigurable hardware co-processor 140. The man- 
ner of storing this data will be further described below 
with respect to Figure 2. Coefficient memory 147 stores 
the unchanging or relatively unchanging process pa- 
rameters called coefficients used by co-processor logic 

2S core 143. Though data memory 145 and coefficient 
memory 147 have been illustrated as separate parts, it 
would be easy to employ these merely as different por- 
tions of a single, unified memory. As will be shown be- 
low, for the multiple multiply accumulate co-processor 

30 described below it is best if such a single unified memory 
have two read ports for data and coefficients and two 
write ports for writing the output data. It is believed best 
that the memory accessible by reconfigurable hardware 
co-processor 1 40 be located on the same integrated cir- 

35 cuit in physical proximity to co-processor logic core 1 43. 
This physical closeness is needed to accommodate the 
wide memory buses required by the desired data 
throughput of co-processor logic core 143. 
[0012] Figure 2 illustrates the memory mapped inter- 

40 face between digital signal processor core 110 and 
reconfigurable hardware co-processor 140. Digital sig- 
nal processor core 110 controls reconfigurable hard- 
ware co-processor 140 via command memory 141. In 
the preferred embodiment command memory 141 is a 

45 first-in-first-out (FIFO) memory. The write port of com- 
mand memory 141 is memory mapped into a single 
memory location within the address space of digital sig- 
nal processor core 110. Thus digital signal processor 
core 110 controls reconfigurable hardware co-proces- 

50 sor 1 40 by writing commands to the address serving as 
the input to command memory 141 . Command memory 

141 preferably includes two circularly oriented pointers. 
The write pointer 151 points to the location within com- 
mand memory 141 where the next received command 

55 is to be stored. Each time there is a write to the prede- 
termined address of command memory 1 41 , write point- 
er selects the physical location receiving the data. Fol- 
lowing such a data write, write pointer 1 51 is updated to 
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point to the next physical location within command 
memory 141. Write pointer 151 is circularly oriented in 
hat it wraps around from the last physical location to the 
first physical location. Reconfigurable hardware co- 
processor 140 reads commands from command mem- 
ory 141 in the same order as they are received (FIFO) 
using read pointer 153. Read pointer 153 points to the 
physical location within command memory 141 storing 
the next command to be read. Read pointer 1 53 is up- 
dated to reference the next physical location within com- 
mand memory 141 following each such read. Note that 
read pointer 153 is also circularly oriented and wraps 
around from the last physical location to the first physical 
location. Command memory 141 includes a feature pre- 
venting write pointer 151 from passing read pointer 153 
This may take place, for example, by refusing to write 
and sending a memory fault signal back to digital signal 
processor core 110 when write pointer 151 and read 
^'T.ll 53 reference the s ame physical location. Thus 
the FIFO buffer of command memory 141 can be full 
and not accept additional commands. 
[001 3] Data memory 1 45 and coefficient memory 1 47 
are both mapped within the data address space of digital 
signal processor core 110. As illustrated in Figure 2 da- 
ta bus 101 is bidirectional^ coupled to memory 149 In 
accordance with the alternative embodiment noted 
above, both data memory 145 and coefficient memory 
147 are formed as a part of memory 147. Memory 147 
is also accessible by co-processor logic core 143 (not 
illustrated in Figure 2). Figure 2 illustrates three circum- 
scribed areas of memory within memory 149. As will be 
further described below, reconfigurable hardware co- 
processor 140 preferably performs several functions 
employing differing memory areas. Note that due to the 
[0014] Integrated circuit 100 operates as follows Dig- 
ital signal processor core 110 controls the data and co- 
efficients used by reconfigurable hardware co-proces- 
sor 140 by loading the data into data memory 145 and 
the coefficients into coefficient memory 147. Alternative- 
ly,, digital signal processor core 110 loads the data and 
coefficients into the unified memory 149. Digital siqnal 
processor core 110maybe programmed to perform this 
data transfer directly. Digital signal processor core 110 
may alternatively be programmed to control direct mem- 
ory access circuit 120 to perform this data transfer Par- 
ticularly for audio or video processing applications, the 
data stream is received at a predictable rate and from a 
predictable input device. Thus it would typically be effi- 
cient for digital signal processor core 110 to control di- 
rect memory access circuit 120 to make transfers from 
external memory to memory accessible by reconfigura- 
ble hardware co-processor 140. 
[0015] Following the transfer of data to be processed 
digital signal processor core 110 signals reconfigurable 
hardware co-processor 140 with the command for the 
desired signal processing algorithm. As previously stat- 
ed, commands are senl to reconfigurable hardware co- 
processor 140 by a memory write to a predetermined 



address. Received commands are stored in command 
memory 141 on a first-in-first-out basis. 
[0016] Each computational command of reconfigura- 
ble hardware co-processor 140 preferably includes a 
manner to specify the particular function to be per- 
formed. In the preferred embodiment, reconfigurable 
hardware co-processor 140 is constructed to be recon- 
fcgurable. Reconfigurable hardware co-processor 140 
,o 8 S6 t ° f functional uni »s, such as multipliers and 
adders, that can be connected together in differing ways 
o perform different but related functions. The set of re- 
lated functions selected for each reconfigurable hard- 
ware co-processor will be based upon a similarity of the 
ma hematics of the functions. This similarity in mathe- 
matics enables similar hardware to be reconfigured for 
the plural functions. The command may indicate the par- 
ticular computation via an opcode in the manner of data 
processor instructions. 

[0017] Each computational command includes a 
manner of specifying the location of the data to be used 
by the computation. There are many suitable methods 
of designating the data space. For example, the com- 
mand may specify a starting address and number of da- 
ta words or samples within the block. The data size may 
be specified as a parameter or it may be specified by 
the opcode defining the computation type. As a further 
example, the command may specify the data size the 
starting address and the ending address of the input da- 

30 k 1 1ha< kn ° Wn indir6Ct metnods of specifying 
30 where the input data is stored may be used The com 

mand may include a pointer to a register or a memory 
location storing any of these parameters such as start 
address, data size, number of samples within the data 
block and end address. 
3S [001 8] Each computational command must further in- 
dicate the memory address range storing the data for 
the particular command. This indication may be made 
by any of the methods listed above with regard to the 
locations storing the input data. In many cases the com- 
putational function will be a filter function and the 
amount of output data following processing will be about 
equivalent to the amount of input data. In other cases 
the amount of output data may be more or less than the 
amount of input data. In any event, the amount of result- 
ant data is known from the amount of input data and the 
type of computational function requested. Thus merely 
specifying the starting address provides sufficient infor- 
mation to indicate where all the resultant data is to be 
stored. It is feasible to store the output data in a destruc- 
tive manner over-writing input data during processing 
Alternatively, the output data may be written to a differ- 
ent portion of memory and the input data preserved at 
least temporarily. The selection between these alterna- 
fives may depend upon whether the input data will be 
*5 reused. 

[0019] Figure 3 illustrates one useful technique in- 
volving alternatively employing two memory areas One 
memory area 144 stores the input data needed for the 
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co-processor function. The relatively constant coeffi- 
cients are stored in coefficient memory 147. This data 
is recalled for use by co-processor logic core 143 (1 
read). The output data is written into the second memory 
area 146 (1 write). Following use of the data memory 
area 144, direct memory access circuit 120 writes the 
data for the next block overwriting the data previously 
used (2 write). At the same time, direct memory access 
circuit 1 20 reads data from memory area 146 ahead of 
it being overwritten by reconfigurable hardware co-proc- 
essor 140 (2 read). These two memory areas for input 
data and for resultant data could be configured as cir- 
cular buffers. In a product that requires plural related 
functions separate memory areas defined as circular 
buffers can be employed. One memory area configured 
as a circular buffer will be allocated to each separate 
function. 

[0020] The format of computational commands pref- 
erably closely resembles the format of a subroutine call 
instruction in a high level language. Thai is, the com- 
mand includes a command name similar in function to 
the subroutine name specifying the particular computa- 
tional function to be performed. Each command also in- 
cludes a set of parameters specifying options available 
within the command type. These parameters may take 
the form of direct quantities or variables, which are point- 
ers to registers or memory locations storing the desired 
quantities. The number and type of these parameters 
depend upon the command type. This subroutine call 
format is important in reusing programs written for digital 
signal processor core 110. Upon use the programmer 
or the compiler provides a stub subroutine to activate 
reconfigurable hardware co-processor 140. This stub 
subroutine merely receives the subroutine parameters 
and forms the corresponding co-processor command 
using these parameters. The stub subroutine then 
writes this command to the predetermined memory ad- 
dress reserved for command transfer to reconfigurable 
hardware co-processor 140 and then returns. This in- 
vention envisions that the computational capacity of dig- 
ital signal processor cores will increase regularly with 
time. Thus the processing requirements of a particular 
product may require the combination of digital signal 
processor core 110 and reconfigurable hardware co- 
processor 140 at one point in time. At a later point in 
time, the available computational capacity of an instruc- 
tion set compatible digital signal processor core may in- 
crease so that the functions previous requiring a recon- 
figurable hardware co-processor may be performed in 
software by the digital signal processor core. The prior 
program code for the product may be easily converted 
to the new, more powerful digital signal processor. This 
is achieved by providing independent subroutines for 
each of the commands supported by the replaced 
reconfigurable hardware co-processor. Then each 
place where the original program employs the subrou- 
tine stub to transmit a command to the reconfigurable 
hardware co-processor is replaced by the correspond- 



ing subroutine call. Extensive reprogramming is thus 
avoided. 

[0021] Following completion of processing on one 
block of data, the data may be transferred out of data 

s memory 1 45 or unified memory 1 49. This second trans- 
fer can take place either by direct action of digital signal 
processor core 110 reading the data stored at the output 
memory locations or through the aid of direct memory 
access circuit 120. This output data may represent the 

10 output of the process. In this event, the data is trans- 
ferred to a utilization device. Alternatively, the output da- 
ta of reconfigurable hardware co-processor 140 may 
represent work in progress. In this case, the data will 
typically be temporarily stored in memory external to in- 

15 tegrated circuit 100 for later retrieval and further 
processing. 

[0022] Reconfigurable hardware co-processor 140 is 
then ready for further use. This further use may be ad- 
ditional processing of the same function. In this case, 

20 the process described above is repeated on a new block 
of data in the same way. This further use may be 
processing of another function. In this case, the new da- 
ta must be loaded into memory accessible by reconfig- 
urable hardware co-processor 140, the new command 

25 loaded and then the processed data read for output or 
further processing. 

[0023] Reconfigurable hardware co-processor 140 
preferably will be able to perform more than one function 
of the product algorithm. Many digital signal processing 

30 tasks will use plural instances of similar functions. For 
example, the process may include many similar filter 
functions. Reconfigurable hardware co-processor 140 
preferably has sufficient processing capability to per- 
form all these filter functions in real time. The advantage 

3S of operating on blocks of data rather than discrete sam- 
ples will be evident when reconfigurable hardware co- 
processor 140 operates in such a system. As an exam- 
ple, suppose that reconfigurable hardware co-proces- 
sor 140 performs three functions, A, B and C. These 
functions may be sequential or they may be interleaved 
with functions performed by digital signal processor core 
110. Reconfigurable hardware co-processor 140 first 
performs function A on a block of data. This function is 
performed as outlined above. Digital signal processor 

45 core 110 either directly or by control of direct memory 
access circuit 120 loads the data Into memory area 155 
of memory 1 49. Upon issue of the command for config- 
uration for function A which specifies the amount of data 
to be processes, reconfigurable hardware co-processor 

so 140 performs function A and stores the -resultant data 
in the a portion of memory area 155 specified by the 
command. A similar process occurs to cause reconfig- 
urable hardware co-processor 140 to perform function 
B on data stored in memory area 157 and return the re- 

ss suit to memory area 157. The performance of function 
may take place upon data blocks having a size unrelated 
to the size of the data blocks for function B. Finally, 
reconfigurable hardware co-processor 140 is com- 
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manded to perform function C on data within memory 
area 159, returning the resultant to memory area 159. 
The block size tor performing function C is independent 
of the block sizes selected for functions A and B. 
[0024] The usefulness of the block processing is seen 
from this example. The three functions A, B and C will 
typically have data flow rates that are independent, that 
is not necessarily equal. Provision of special hardware 
for each function will sacrifice the generality of function- 
ality and reusability of reconfigurable hardware. Further, 
it would be difficult to match the resources granted to 
each function in hardware to provide a balance and the 
best utilization of the hardware. When reconfigurable 
hardware is used there is inevitably an overhead cost 
for switching between configurations. Operating on a 
sample by sample basis for flow through the three func- 
tions would require a maximum number of such recon- 
figuration switches. This would clearly be less than op- 
timal. Thus operating each function on a block of data 
before reconfiguration to switch between functions 
would reduce this overhead. Additionally, it would then 
be relatively easy to allocate resources between the 
functions by selecting the amount of time devoted to 
each function. Lastly, such block processing would gen- 
erally require less control overhead from the digital sig- 
nal processor core than switching between functions at 
a sample level. 

[0025] The block sizes selected for the various func- 
tions A, B and C will depend upon the relative data rates 
required and the data sizes. In addition, the tasks as- 
signed to digital signal processor core 110 and their re- 
spective computational requirements must also be con- 
sidered. Ideally, both digital signal processor core 110 
and reconfigurable hardware co-processor 140 would 
be nearly fully loaded. This would result in optimum use 
of the resources. Such balanced loading of digital signal 
processor core 110 and reconfigurable hardware co- 
processor 140 may only be achieved with product algo- 
rithms that can use reconfigurable hardware co-proces- 
sor 1 40 about 50% of the computations. For the case in 
which reconfigurable hardware co-processor 140 can 
perform more that half of the minimum required compu- 
tations, the additional features implemented on digital 
signal processor core 110 can be added to the product 
to match the loading. This would result in use of spare 
computational resources in digital signal processor core 
110. The loading of computational processes may be 
statically determined. Such static computational alloca- 
tion can best be made when both digital signal proces- 
sor core 110 and reconfigurable hardware co-processor 
1 40 perform fixed and known functions. If the computa- 
tional load is expected to change with time, then it will 
probably be best to dynamically allocate computational 
resources between digital signal processor core 110 
and reconfigurable hardware co-processor 140 are run 
time. It is anticipated that the processes performed by 
reconfigurable hardware co-processor 140 will remain 
relatively stable and only the processes performed by 



digital signal processor core 110 would vary. 
[0026] Figure 4 shows a memory management tech- 
nique that enables better interruption of operations. Da- 
ta 400 consisting of data blocks 401 , 402 and 403 pass- 
im es the window 410 of a finite impulse filter Such filters 
operate on a time history of data. Three processes A, B 
and C operate in respective circular buffers 421, 431 
and 441 within data memory 1 45. Such a circular buffer 
enables the history to be preserved. Thus when 
io processing the next block following other processing, 
the history data is available at predictable addresses for 
use. This history data is just before the newly written 
data for the next block. 

[0027] This technique works well except if memory 
'* space needs to be cleared to permit another task. I n that 
event, the history data could be flushed and reloaded 
upon resumption of the filter processing. Alternatively, 
the history data needed for the next block could be 
moved to another area of memory 1 45 or to an external 
20 memory attached to external memory interface 1 30. Ei- 
ther of these methods is disadvantageous because they 
require time to move data. This either delays servicing 
the interrupt or resuming the original task. 
[0028] A preferred alternative is illustrated schemati- 
25 cally in Figure 4. During the writing of the resultant data 
to its place in memory, the current sample is written to 
a smaller area of memory. For example, input data from 
circular buffer 421 is written into history buffer 423, input 
data from circular buffer 431 is written into history buffer 
30 433, and input data from circular buffer 441 is written 
into history buffer 443. Each of the history buffers 423, 
433 and 443 are just the size needed to store the history 
according to the width of the corresponding filter window 
such as filter window 41 0. Upon completion of process- 
as ing of a block of data, the most recent history is stored 
in this restricted area. If the co-processor must be inter- 
rupted the data within the circular buffers 421 , 431 and 
441 may be cleared without erasing the history data 
stored in history buffers 423, 433 and 443. This tech- 
no nique spares the need for reloading the data or storing 
the data else where prior to beginning the interrupt task. 
In many filter tasks enough write memory bandwidth will 
be available to achieve writing to the history buffers with- 
out requiring extra cycles. Another advantage of this 
^5 technique is that less memory need be allocated to cir- 
cular buffers 421 , 431 and 441 than previously. In the 
previous technique, the circular buffers must be large 
enough to include an entire block of data and an addi- 
tional amount equal to the required history data. The 
50 technique illustrated in Figure 4 enables the size of the 
circular buffers 421, 431 and 441 to be reduced to just 
enough to store one block of data. 
[0029] Many algorithms useful in audio and video sig- 
nal processing involve adapting coefficients. That is, 
55 there is some feedback path that changes the function 
performed over time. An example of such a algorithms 
is a modem that requires a time to adapt to the particular 
line employed and the operation of the far end modem. 
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Initially it would seem that performing such adaptive 
functions in block mode would adversely effect the con- 
vergence of these adaptive functions. Review of the 
mathematics involved in many such functions shows 
otherwise. The amount of adaption that can be per- 
formed at a particular time generally depends upon the 
amount of data available for computing the adaption. 
This amount of available data does not depend upon 
whether the data is processed sample by sample or in 
blocks of samples. In practice the rate of adaption will 
be about the same. Adaption on a sample by sample 
basis would result in convergence toward the fully 
adapted coefficients in many small steps. Adaption 
based upon blocks of data would result in convergence 
in fewer and larger steps. This is because the greater 
amount of data available would drive a larger error term 
for correction in the block processing case. However, 
the average convergence slope would be the same for 
the two cases. In cases where most of the adaption 
lakes place upon initialization and most of the process- 
ing takes place under steady state conditions, such as 
the previous modem example, there would be little prac- 
tical difference. In cases where the adaptive filter must 
follow a moving target, it is not clear whether adaption 
on a sample by sample basis is better than adaption of 
a block basis. If, for example, the process followed var- 
ies at a frequency greater than the inverse of the time 
of the block size, then adaption on a block basis may 
prevent useless hunting in small steps as compared with 
sample by sample adaption. Thus adaptive filtering on 
a block basis has no general disadvantage over adap- 
tive filtering on a sample by sample basis. 
[0030] The command set of reconfigurable hardware 
co-processor 140 preferably includes several non-com- 
putational instructions for control functions. These con- 
trol functions will be useful in cooperation between dig- 
ital signal processor core 110 and reconfigurable hard- 
ware co-processor 140. The first of these non-compu- 
tational commands is a receive data synchronization 
command. This command will typically be used in con- 
junction with data transfers handled by direct memory 
access circuit 120. Digital signal processor core 110 will 
control the process by setting up the input data transfer 
through direct memory access circuit 1 20. Digital signal 
processors core 110 will send two commands to recon- 
figurable hardware co-processor 140. The first com- 
mand is the receive data synchronization command. 
The second command is the computational command 
desired. 

[0031] Reconfigurable hardware co-processor 140 
operates on commands stored in command memory 
141 on a first-in-first-out basis. Upon reaching the re- 
ceive data synchronization command reconfigurable 
hardware co-processor will stop. Reconfigurable hard- 
ware co-processor will remain idle until it receives a con- 
trol signal from direct memory access circuit 120 indi- 
cating completion of the input data transfer. Note that 
upon such completion of this input data transfer, the data 



12 

for the next block is stored in data memory 1 45 or unified 
memory 1 49. Direct memory access circuit 1 20 may be 
able to handle plural queued data transfers. This is 
known in the art as plural DMA channels. In this case 

s the receive data synchronization command must note 
the corresponding DMA channel, which would be known 
to digital signal processor core 110 before transmission 
of the receive data synchronization command. Direct 
memory access circuit 120 would transmit the channel 

10 number of each completed data transfer. This would 
permit reconfigurable hardware co-processor 140 to 
match the completed direct memory access with the cor- 
responding receive data synchronization command. 
Reconfigurable hardware co-processor would continue 

*5 to the next command only if a completed direct memory 
access signal indicated the same DMA channel as spec- 
ified in the receive data synchronization command. 
[0032] Following this completion signal, reconfigura- 
ble hardware co-processor 140 advances to the next 

20 command in command memory 141. In this case this 
next command is a computational command using the 
data just loaded. Since this computational command 
cannot start until the previous receive data synchroni- 
zation command completes, this assures that the cor- 

25 rect data has been loaded. 

[0033] This combination of the receive data synchro- 
nization command and the computational command re- 
duces the control burden on digital signal processor 
core 110. Digital signal processor core 110 need only 

30 set up direct memory access circuit 1 20 to make the in- 
put data transfer and send the pair of commands to 
reconfigurable hardware co-processor 140. This would 
assure that the input data transfer had completed prior 
to beginning the computational operation. This greatly 

35 reduces the amount of software overhead required by 
the digital signal processor core 110 to control the func- 
tion of reconfigurable hardware co-processor 140. Oth- 
erwise, digital signal processor core may need to re- 
ceive an interrupt from direct memory access circuit 1 20 

^0 signaling the completion of the input data load opera- 
tion. An interrupt service routine must be written to serv- 
ice the interrupt. In addition, such an interrupt would re- 
quire a context switch to sent the co-processor com- 
mand to command memory and another context switch 
to return from the interrupt. Consequently, the receive 
data synchronization command frees considerable ca- 
pacity within digital signal processor core for more pro- 
ductive use. 

[0034] Another non-computational command is a 
so send data synchronization command. The send data 
synchronization command is nearly the inverse of the 
receive data synchronization command. Upon reaching 
the send data synchronization command, reconfigura- 
ble hardware co-processor 1 40 triggers a direct memory 
55 access operation. This direct memory access operation 
reads data from data memory 145 or unified memory 
149 for storage at another system location. This direct 
memory access operation may be preset by digital sig- 
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nal processor core 110 and is merely begun upon re- 
ceipt of a signal from reconfigurable hardware co-proc- 
essor 140 upon encountering the send data synchroni- 
zation command. In the case in which direct memory 
access circuit 120 supports plural DMA channels the 
send data synchronization command must specify the 
DMA channel triggered. Alternatively, the send data 
synchronization command may specify the control pa- 
rameters for direct memory access circuit 1 20, including 
the DMA channel if more than one channel is supported 
Upon encountering such a send data synchronization 
command, reconfigurable hardware co-processor com- 
municates directly with direct memory access circuit 1 20 
to set up and start an appropriate direct memory access 
operation. 

[0035] Another possible non-computational com- 
mand is a synchronization completion command. Upon 
encountering a synchronization completion command, 
reconfigurable hardware co-processor 140 sends an in- 
terrupt to digital signal processor core 110. Upon receiv- 
ing such an interrupt, digital signal processor core 110 
is assured that all prior commands sent to reconfigura- 
ble hardware co-processor 140 have completed. De- 
pending upon the application, it may be better to control 
via interrupts than through send and receive data syn- 
chronization commands. It may also be better to queue 
several operations for reconfigurable hardware co-proc- 
essor 140 using send and receive data synchronization 
commands and then interrupt digital signal processor 
core 1 1 0 at the end of the queue. This may be useful for 
higher level control functions by digital signal processor 
core 110 following the queued operations by reconfig- 
urable hardware co-processor. 

[0036] Figure 5 illustrates another possible arrange- 
ment of circuit 100. Circuit 100 illustrated in Figure 5 in- 
cludes two reconfigurable hardware co-processors. 
Digital signal processor core 110 operates with first 
reconfigurable hardware co-processor 1 40 and second 
reconfigurable hardware co-processor 180. A private 
bus 185 couples first reconfigurable hardware co-proc- 
essor 140 and second reconfigurable hardware co- 
processor 1 80. These co-processors have private mem- 
ories sharing the memory space of digital signal proc- 
essor core 110. The data can be transferred via private 
bus 185 by one co-processor writing to the address 
range encompassed by the other co-processor's private 
memory. Alternatively, each co-processor may have an 
output port directed toward an input port of the other co- 
processor with the links between co-processors encom- 
passed in private bus 1 85. This construction may be par- 
ticularly useful for products in which data flows from one 
type operation handled by one co-processor to another 
type operation handled by the second co-processor. 
This private bus frees digital signal processor core 110 
from having to handle the data handoff either directly or 
via direct memory access circuit 120. 
[0037] Figures 6 to 9 illustrate the construction of an 
exemplary reconfigurable hardware co-processor. This 



particular co-processor is called a multiple multiply-ac- 
cumulator. The multiply-accumulate operation where 
the sum of plural products is formed is widely used in 
signal processing. Many filter algorithms are built 
s around these functions. 

[0038] Figure 6 illustrates the overall general archi- 
tectureof multiple multiply-accumulator 1 40. Data mem- 
ory 145 and coefficient memory 147 may be written to 
m 128 bit words. This write operation is controlled by 
io digital signal processor core 110 or direct memory ac- 
cess circuit 120. Address generator 150 generates the 
addresses for recall of data and coefficients used by the 
co-processor. This read operation operates on data 
words of 128 bits from each memory. 
is [0039] These recalled data words are supplied to in- 
put formatter 160. Input formatter 160 performs various 
shift and alignment operations generally to arrange the 
128 bit input data words into the order needed for the 
desired computation. Input formatter outputs a 128 bit 
20 (8 by 1 6 bits) Data X, a 1 28 bil (8 by 1 6 bits) Data Y and 
a 64 bit (2 by 32 bits) Data Z. 

[0040] These three data streams are supplied to da- 
tapath 170. Datapath 170 is the operational portion of 
the co-processor. As will be further described below da- 
& tapath 170 includes plural hardware multipliers and 
adders that are connectable in various ways to perform 
a variety of multiply-accumulate operations. Datapath 
170 outputs two adder data streams. Each of these is 4 
32 bit data words. 
30 [0041] These two data streams supply the inputs to 
output formatter 180. Output formatter 180 rearranges 
the two data streams into two 128 bit data word for writ- 
ing back into the two memories. The addresses for these 
write operations are computed by address generator 
3S 150. This rearrangement may take care of alignment on 
memory word boundaries. 

[0042] The operations of co-processor 140 are under 
control of control unit 1 90. Control unit 1 90 recalled the 
commands from command memory 141 and provides 
*o the corresponding control within co-processor 1 40. 
[0043] The construction of input formatter 1 60 is illus- 
trated in Figure 7. Each of the two data streams of 128 
bits are supplied to an input of multiplexers 205 and 207. 
Each multiplexer independently selects one input for 
45 storage in its corresponding register 21 5 and 21 7 Mul- 
tiplexer 205 may select to recycle the contents of regis- 
ter 215 as well and either data stream. Multiplexer 207 
may only select one of the input data streams. Multiplex- 
ers 201 and 203 may select the contents of register 215 
50 or may select recycling of the contents of their respec- 
tive registers 211 and 213. Multiplexer 129 selects the 
contents of either register 211 or 213 for supply to the 
upper bits of shifter 221 . Tho lower bits are suppliedf rom 
register 215. Shifter 221 shifts and selects only 1 28 bits 
of its 256 input bits. These 128 bits are supplied to du- 
plicate/swap unit 223. Duplicate/swap unit 223 may du- 
plication a portion of its input into the full 128 bits or it 
may rearrange the data order. Thus sorted, the data is 



8 



m 



15 



EP 0 945 788 A2 



16 



temporarily stored in register 225. This forms the Data 
X input to datapath 170. The output of multiplexer 207 
is supplied directly to multiplexer 233 and well as sup* 
plied via register 217. Multiplexer 233 selects 192 bits 
from the bits supplied to it. The upper 128 bits form the 
Data Y input to datapath 170. These bits may be recir- 
culated via multiplexer 235. The lower 64 bits forms the 
Data Z input to datapath 170. 

[0044] Figure 8 illustrates in block diagram form the 
construction of datapath 170. Various segments of the 
Data X and the Data Y inputs supplied from input for- 
matter are supplied to dual multiply adders 310, 320, 
330 and 340. As shown, the first and second 16 bit data 
words Data X[0:1] and Data Y[0:1] are coupled to dual 
multiply adder 31 0, the third and fourth 1 6 bit data words 
Data X[2:3J and Data Y[2:3J are coupled to dual multiply 
adder 320, the fifth and sixth 16 bit data words Data X 
[4:5] and Data Y[4:5] are coupled to dual multiply adder 
330 and the seventh and eighth 16 bit data words Data 
X[6:7] and Data Y[6:7] are coupled to dual multiply 
adder 340. Each of these units is identical, only dual 
multiply adder 310 will be described in detail. The least 
significant 16 Data X and Data Y bits supply inputs to 
multiplier 311. Multiplier 311 receives the pair of 16 bit 
inputs and produces a 32 bit product. This product is 
stored in a pair of pipeline output registers. The 32 bit 
output is supplied to both sign extend unit 31 3 and an 8 
bit left shifter 31 4. Sign extend unit 31 3 repeats the sign 
bit of the product/which is the most significant bit, to 40 
bits. The 8 bit left shifter 314 left shifts the 32 bit product 
and zero fills the vacated least significant bits. One of 
these two 40 bit quantities is selected in multiplexer 31 6 
for application to a first input of 40 bit adder 319. In a 
similar fashion, the next most significant 16 Data X and 
Data Y bits are supplied to respective inputs of multiplier 
312. Multiplier 312 receives the two 16 bit inputs and 
produces a 32 bit product. The product is stored in a pair 
of pipeline registers. The 8 bit right shifter 31 5 right shifts 
the product by 8 bits and zero fills the vacated most sig- 
nificant bits. Multiplexer 317 selects from among three 
quantities. The first quantity is a concatenation of the 16 
Data X bits and the 1 6 Data Y bits at the input. This input 
allows multiplier 312 to be bypassed. If selected the 32 
bits (as sign extended by sign extender 318) are added 
to the product produced by multiplier 311 . The second 
quantity is the product supplied by multiplier 312. The 
third quantity is the shifted output of 8 bit right shifter 
315. The selected quantity from multiplexer 317 is sign 
extended to 40 bits by sign extend unit 31 8. The sign 
extended 40 bit quantity is the second input to 40 bit 
adder 319. Adder 319 is provided with 40 bits even 
though the 16 bit input factors would produce only 32 
bits to provide dynamic range for plural multiply accu- 
mulates. 

[0045] The output of the adders 31 9 within each of the 
dual multiplier adder units 310, 320, 330 and 340 are 
provided as the first adder stage output adder_st1 _outp. 
Only the 32 most significant adder output bits is con- 



nected to the output. This provides a 4 by 32 bit our 1 28 
bit output. 

[0046] A second stage of 40 bit adders includes 
adders 353 and 355. Adder 353 adds the outputs of dual 
s multiply adder units 310 and 320. Adder 355 adds the 
outputs of dual multiplier adder units 330 and 340. Two 
other data paths join within the second adder stage. The 
least significant 32 bits of the Data Z input is temporarily 
stored in pipeline register 351 . This 32 bit quantity is sign 

10 extended to 40 bits in sign extend unit 352. In a similar 
fashion, the most significant bits of the Data Z input is 
temporarily stored in pipeline register 357. This quantity 
is sign extended to 40 bits by sign extend unit 358. 
[0047] The third adder stage includes adders 361, 

is 363, 367 and 368. Adder 361 is 40 bits wide. It adds the 
output of adder 353 and the sign extended least signif- 
icant Data Z bits. The 32 most significant bits of this sum 
are supplied as part of the third stage output 
adder_st3_outp. Similarly, adder 363 is 40 bits wide and 

20 adds the output of adder 355 and the sign extended 
most significant Data Z bits. The 32 most significant bits 
of this sum are supplied as part of the third stage output 
adder_st3_outp. The connections to adders 367 and 
368 are much more complicated. The first input to adder 

2S 367 is either the output of adder 353 of the second stage 
or a recirculated output as selected by multiplexer 364. 
Multiplexer 371 selects from among 8 pipeline registers 
for the recirculation quantity. The second input to adder 
367 is selected by multiplexer 365. This is either the 

30 least significant Data Z input as sign extended by sign 
extend unit 353, the direct output of adder 368, the out- 
put of adder 355 or a fixed rounding quantity rnd_add. 
Addition of the fixed rounding quantity rnd__add causes 
the adder to round the quantity at the other input. The 

35 output of adder 367 supplies the input to variable right 
shifter 375. Variable right shifter 375 right shifts the sum 
a selected amount of 0 to 15 bits. The 32 most significant 
bits of its output forms a part of the third stage output 
adder_st3_outp. The first input to adder 368 is the out- 

40 put of adder 355. The second input to adder 368 is se- 
lected by multiplexer 366. Multiplexer 366 selects either 
the output of adder 353, the most significant Data Z input 
as sign extended by sign extend unit 358, the recircula- 
tion input or the fixed rounding quantity rnd_add. Multi- 

4 5 piexer 373 selects the recirculation quantity from among 
8 pipeline registers at the output of adder 368. The out- 
put of adder 368 supplies the input to variable right shift- 
er 377. Variable right shifter 377 right shifts the sum a 
selected amount of 0 to 15 bits. The 32 most significant 

so bits of its output forms another part of the third stage 
output adder_st3_outp. 

[0048] Figure 9 illustrates the construction of the out- 
put formatter illustrated in Figure 6. 
[0049] Figures 10 to 13 illustrate several ways that 
55 multiple multiply accumulate co-processor 160 may be 
configure. The data flow in each of these examples can 
be achieved by proper selection of the multiplexers with- 
in datapath 170. The following description will note the 
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corresponding multiplexer selections when they are rel- 
evant to achieving the desired data flow. 
[0050] Figure 1 0 illustrates the data flow in a real finite 
impulse filter (FIR). Data DO to D7 and coefficients CO 
to C7 are supplied to respective multipliers 311 312 
321, 322, 331, 3332, 341 and 342. In this case multi- 
plexers corresponding to multiplexer 31 7 in dual multiply 
adder unit 310 each select the product of the respective 
multipliers 312, 322, 332 and 342. Pairs of products are 
summed in adders 31 9, 329, 339 and 349. Pairs of these 
sums are further summed in adders 353 and 355. The 
sums formed by adders 353 and 355 are adder in adder 
368. In this case, multiplexer 366 selects the sum pro- 
duced by adder 353 for the second input to adder 368. 
Adder 367 does the accumulate operation Multiplexer 
364 selects the output of multiplexer 37 1 selecting a 
pipeline register for recirculation as the first input to 
adder 363. Multiplexer 365 selects the output of adder 
368 and the second input to adder 363 Adder 367 pro- 
duces the filter output. Note that this dala flow produces 
the sum of the 8 products formed with the prior summed 
products. This operation is generally known as multiply 
accumulate and is widely used in filler functions. Con- 
figuration of datapath 170 as illustrated in Figure 7 per- 
mits computation of the accumulated sum of S products 
This greatly increased the throughput in this data flow 
over the typical single product accumulation provided 
by digital signal processor core 110. 
[0051] Figure 1 1 illustrates the data flow of a complex 
FIR filter. This data flow is similar to that of the real FIR 
filter illustrated in Figure 7. The data flow of Figure 8 
simultaneously operates on the real and imaginary parts 
of the computation. Data and coefficients are supplied 
to respective multipliers 311 , 312, 321,322, 331,3332, 
341 and 342. Multiplexers corresponding to multiplexer 
317 in dual multiply adder unit 310 each select the prod- 
uct of the respective multipliers 312, 322, 332 and 342, 
Pairs of products are summed in adders 31 9, 329, 339 
and 349. Pairs of these sums are further summed in 
adders 353 and 355. The real and complex parts are 
separately handled by adders 367 and 368. Multiplexer 
365 selects the sum of adder 353 for the second input 
to adder 367. Multiplexer 364 selects the output of mul- 
tiplexer 371, selecting a pipeline register for recircula- 
tion, as the first input to adder 363. Adder 368 receives 
the sum of adder 355 as its first input. Multiplexer 366 
selects the recirculation output of multiplexer 373 for the 
second input to adder 368. The pair of adders 367 and 
368 thus produce the real and imaginary parts of the 
multiply accumulate operation. 

[0052] Figure 12 illustrates the data flow in a coeffi- 
cient update operation. The error terms E0 to E3 are 
multiplied by the corresponding weighting terms W0 to 
W3 in multipliers 311, 321, 331 and 341. The current 
coefficients to be updated CO to C3 are input directly to 
adders 31 9, 329, 339 and 349 as selected by multiplex- 
ers 317, 327, 337 and 347. The respective products are 
added to the current values in adders 31 9, 329, 339 and 



349. In this case the output is produced by adders 31 9, 
329, 339 and 349 via the adder stage 1 output 
adder_st1_outp. 

[0053] Figure 1 3 illustrates the data flow in a fast Fou- 
5 rier transform (FFT) operation. The FFT operation starts 
with a 16 bit by 32 bit multiply operation. This is achieved 
as follows. Each dual multiply adder 310, 320, 330 and 
340 receives a respective 1 6 bit quantity AO to A3 at one 
input of each of the paired multipliers 311 and 312, 321 
10 and 322, 331 and 332, and 341 and 341. Multipliers 311 , 
321 , 331 and 341 receive the 16 most significant bits of 
the 32 bit quantity BOH to B3H. Multipliers 312, 322, 332 
and 342 receive the 16 least significant bits of the 32 bit 
quantity B0L to B3L. Shifters 314, 315, 324, 325, 334, 
15 335, 344 and 345 are used to align the products. Multi- 
plexers 316, 326, 336 and 346 select the left shifted 
quantity from respective 8 bit left shifters 314, 324, 334 
and 344 for the first input into respective adders 319, 
329, 339 and 349. Multiplexers 317, 327, 337 and 347 
20 select the right shifted quantity from respective 8 bit right 
shifters 315, 325, 335 and 345 as the second inputs to 
respective adders 319, 329, 339 and 349. These two 
oppositely directed 8 bits shifts provide an effective 16 
bit shift for aligning the partial products for a 16 bit by 
2S 32 bit multiply. Pairs of these sums are further summed 
in adders 353 and 355. Adder 361 adds the Data 20 
input with the output from adder 353. Multiplexer 364 
selects the sum of adder 353 as the first input to adder 
367. Multiplexer 365 selects that Data 20 input as the 
30 second input to adder 367. Adder 368 receives the sum 
of adder 355 as its first input. Multiplexer 366 selects 
the Data 21 input as the second input to adder 366. 
Adder 363 adds the sun of adder 355 and the Data 21 
input. The output of the FFT operation is provided by the 
55 sum outputs of adders 361 , 367, 368 and 363. 

[0054] The list below is a partial list of some of the 
commands that may be performed by the data path 170 
of multiple multiply accumulate unit 140 illustrated in 
Figures 3 to 6. 

40 

vector_add_16b(len, pdata, pcoeff, pout) 
vector_add_32b(len, pdata, pcoeff, pout) 
vector_mpy_16b(len, pdata, pcoeff, pout) 
vector_mpy_1632b(len, pdata, pcoeff, pout) 
4 & vector_mpy_32b(len, pdata, pcoeff, pout) 

scalar_vector_add_l6b(len, pdata, pcoeff, pout) 
scalar_vector_add_32b(len, pdata, pcoeff, pout) 
scalar_vector_mpy_16b(len, pdata, pcoeff, pout) 
scalar_vector_mpyJ632b(len, pdata, pcoeff, pout) 
50 scalar_vector_mpy_32b(len, pdata, pcoeff, pout) 



For these operations, the operation name indicates the 
data size. The "len" parameter field indicates the length 
of the function. The "pdata" parameter field is a pointer 
55 to the beginning memory address containing the input 
data. The "pcoeff" parameter field is a pointer to the be- 
ginning memory address containing the coefficients for 
the filter. The "pout" parameter field is a pointer to the 
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beginning memory address to receive the output. As 
previously described, these pointer preferably point to 
respective locations within data memory 145 and coef- 
ficient memory 147 or unified memory 149. 

5 

FFT_real(fft_size, pdata, pcoeff, pout) 
FFT_complex(fft_size, pdata, pcoeff, pout) 

The fast Fourier transform operations preferably all in- 
clude 32 bit data and 16 bit coefficients as previously io 
described in conjunction with Figure 10. The fft_size pa- 
rameter field defines the size of the function. The other 
listed parameter fields are as described above. 2. 

FIR_real(us, ds, len, blocksize, pdata, pcoeff, pout) 15 
FIR_complex_real(us, ds, len, blocksize, pdata, 
pcoeff, pout) 

FIR_complex_real_sum(us, ds, len, blocksize, pda- 
ta, pcoeff, pout) 

FIR_complex(us, ds, len, blocksize, pdata, pcoeff, 20 
pout) 

3. 

The finite impulse response filter operations differ in the 
type of the data and coefficients. The FIR_real operation 
employs real data and real coefficients. The zs 
FIR_complex_real operation employs complex data 
and real coefficients. The FIR_complex_reaLsum op- 
eration separately sums the complex and real parts em- 
ploying complex data and real coefficients. The 4. 
FIR_compfex operation employs both complex data and 30 
complex coefficients. The us parameter field indicates 
the upsampling ratio. The ds parameter field indicates 
the down sampling ratio. The blocksize parameter field 
indicates that size of the operational blocks employed. 
The other parameter fields are as previously described. 35 5. 
[0055] The parameters of all these commands could 
be either immediate values or, for the data, coefficient 
and output locations, 1 6 bit address pointers into the co- 
processor memory. This selection would mean that the 
finite impulse filter commands, which are the longest, 40 
would require about five 16 bit command words. This 
would be an insignificant amount of bus traffic. Alterna- 
tively, the parameter fields could be indirect, that is iden- 6. 
tify a register from a limited set of registers for each pa- 
rameter. There could be a set of 8 registers for each pa- 45 
rameter, requiring only 3 bits each within the command 
word. Since only a limited number of particular filter set- 
lings would be required, this is feasible. 

7. 

so 

Claims 

1. A data processing system disposed on a single in- 
tegrated circui.t comprising: 8. 

55 

a digital signal processor core connected to a 
data bus and an address bus, said digital signal 
processing core operable, for generating co- 



processor commands; 

a co-processor connected to said data bus, 
said address bus and said digital signal 
processing core, said co-processor having a lo- 
cal memory within the address space of said 
digital signal processor core and responsive to 
commands generated by said digital signal 
processor core to perform predetermined data 
processing operations on data stored in said lo- 
cal memory in parallel to said digital signal 
processor core. 

The data processing system of claim 1 , further com- 
prising: 

a direct memory access circuit under the con- 
trol of said digital signal processor and capable of 
autonomously transferring data between prede- 
fined addresses in memory including transferring 
data to and from said local memory of said co-proc- 
essor. 

The data processing system of claim 2, wherein: 

said co-processor is responsive for receiving 
a data synchronism command for pausing process- 
ing commands until said direct memory access cir- 
cuit signals completion of a predetermined memory 
transfer of data into said local memory. 

The data processing system of claim 2, wherein: 

said co-processor is responsive to a send da- 
ta synchronism command for signalling said direct 
memory access circuit to trigger a predetermined 
memory transfer of data out of said local memory 

The data processing system of any preceding 
claim, wherein: 

said co-processor further includes a com- 
mand first-in-first out memory having a input re- 
sponsive to data written to a predetermined mem- 
ory address and an output for controlling operation 
of said co-processor. 

The data processing system of any preceding 
claim, wherein said co-processor is responsive to 
said commands for configuring itself correspond- 
ingly whereby said co-processor is operable to per- 
form a set of related data processing operations. 

The data processing system of any preceding 
claim, wherein said co-processor is responsive to 
an interrupt command for transmiting an interrupt 
signal to said digital signal processor core. 

The data processing system of any preceding 
claim, wherein each command includes an indica- 
tion of a data input location within said local mem- 
ory; and 

said co-processor is responsive to said com- 
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mands to recall data from said local memory start- 
ing with said indicated data input location. 

The data processing system of any preceding 
claim, wherein each command includes an indica- s 
tion of a data output location within said local mem- 
ory; and 

said co-processor is responsive to said com- 
mands for storing resultant data from a data 
processing operation corresponding to said com- 10 
mand in local memory starting with said indicated 
data input location. 



22 



10. A method of data processing comprising the steps 
of: 



15 



providing a local memory within a co-processor 
having addresses within a memory map of a 
digital signal processor core; 
transferring data to said local memory; 20 
transmitting a command to said co-processor 
thereby causing said co-processor to perform 
a corresponding data processing operation in 
parallel to said digital signal processor core and 
store results in said local memory; and 25 
transferring said results out of said local mem- 
ory of said co-processor. 

11. The method of claim 10 wherein: 

said step of transferring data to said local 30 
memory comprises storing data in a next location in 
a circularly organized memory area serving as an 
input buffer. 

12. The method of claim 10 or claim 11 wherein: 35 

said step of storing results in said local mem- 
ory comprises storing data in a next location in a 
circularly organized memory area serving as an out- 
put buffer. 

40 

13. The method of claim 12, further comprising: 

storing input data within a circularly organized 
history buffer having a size corresponding to a time 
extent of said corresponding data processing oper- 
ation substantially concurrently with said step of 45 
storing results in said local memory. 
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